Stack Overflow Documentation Data Dump
Item Preview
There Is No Preview Available For This Item
This item does not appear to have any files that can be experienced on Archive.org.
Please download files in this item to interact with them on your computer.
Show all files
Share or Embed This Item
- Publication date
- 2017-09-08
- Topics
- stackoverflow documentation
- Collection
- opensource
- Language
- English
- Addeddate
- 2017-09-08 17:40:39
- Identifier
- documentation-dump.7z
- Identifier-ark
- ark:/13960/t0jt5vn75
- Scanner
- Internet Archive HTML5 Uploader 1.6.3
- Year
- 2017
comment
Reviews
Subject: Created a wiki website
Subject: Relocation
Subject: Website created from the SO Data Dump
It’s current read-only, but I plan to make a wiki from it to allow people to edit example and make embedded live examples using some fiddle such as .NET Fiddle, SQL Fiddle, JS Fiddle which I believe was highly missing to be an “example first” documentation.
Subject: Easy to use but unusable dates
I found it relatively easy to extract the data I wanted from the archive. My only serious criticism is the obscure (to non-Unix users) date format. I do not need dates but, if they are thought to be important, I believe the archive should be recreated with standard JSON dates
Detail
I had started writing an introduction to Outlook VBA within Stack Overflow Documentation, which I did not wish to lose. I have started extracting my text from the website but had not finished when the documentation was taken down so if I was to find my text I would have to extract it from the archive.
File “documentation-dump.7z” was easy to download. WinZip extracted its contents just as easily. No doubt, your favourite extraction utility will work just as effectively.
The file “readme.txt” seemed the obvious start point. This file lists the other files each with what look like a list of field names. There is no other documentation that I have found. I have decoded the files of interest to me without difficulty so perhaps no more documentation was necessary. On the other hand, I have had much practice at decoding undocumented files so others may find this lack of documentation more troubling.
Most of the files had an extension of “json” which meant nothing to me. A search for “json” found http://json.org/ which provided an adequate definition of the format of JavaScript Object Notation which is a lightweight data-interchange format. Before the days of XML I was a specialist in electronic data interchange so again I may not be the best judge of the adequacy of this definition.
Starting with the first file, “contributors.json”, I found:
[
{
"Id": 1,
"DocTopicId": 1,
"UserId": 80572,
"DocContributorTypeId": 2,
"CreationDate": "\/Date(1446697142040-0500)\/"
},
{
"Id": 2,
and so on
With my newly acquired knowledge of this format, I knew “{name/value pair, name/value pair, …}” was an object and “[” was the start of an array so the file was an array of these simple objects. The names and values looked obvious enough and with the one exception of "\/Date(1446697142040-0500)\/".
When searching for “json”, “json date format” is the first suggestion. Apparently, "\/Date(1446697142040)\/" is milliseconds since 0:00 on 1 January 1970. I can find nothing to explain “-0500” which seems to be a private SO addition; I assume it is something to do with a time zone. Apparently, JSON does not define date formats and conventions have changed over the years. However, a Stack Overflow answer with a score of 1140 recommends ISO 8601’s format: “2012-04-23T18:25:43.511Z”. Apparently this format is endorsed by everyone that matters. The only reason for considering milliseconds since 1970 seems to be that it is a standard within Unix and even the oldest libraries have routines that can read it. This is not a standard that appeals to non-Unix users. I do not know if dates are important to anyone; certainly they are not important to me. If they are thought to be important, I believe the archive should be recreated with dates in ISO 8601 format so everyone can use them.
I searched “contributors.json” for my user id and found enough occurrences to match the number of examples I had written.
I tend to use Excel and VBA for this type of investigation. VBA is an adequate language and Excel worksheets are a convenient repository for poorly understood data. I wrote code to extract each object containing my user id and save it as a row within an Excel worksheet. The first few rows and columns of that worksheet contain:
Row| A | B | C | D | E | F |
|-----+----------+------------+------+--------------------+--------------------------|
1| Id|DocTopicId|DocExampleId|UserId|DocContributorTypeId|CreationDate |
|-----+----------+------------+------+--------------------+--------------------------|
2|79143| | 26136|973283| 2|/Date(1484774756887-0500)/|
|-----+----------+------------+------+--------------------+--------------------------|
3|79144| 8111| |973283| 2|/Date(1484774756887-0500)/|
|-----+----------+------------+------+--------------------+--------------------------|
4|79145| | 27558|973283| 2|/Date(1484774756887-0500)/|
|-----+----------+------------+------+--------------------+--------------------------|
5|79350| | 27628|973283| 2|/Date(1485051525857-0500)/|
|-----+----------+------------+------+--------------------+--------------------------|
Sorry if the above if difficult to read. SO's pre-formatting of text does not work here.
The column “DocExampleId” looked interesting so I looked at the file “examples.json” which starts:
[
{
"Id": 1,
"DocTopicId": 1,
"Title": "Basic Usage",
"CreationDate": "\/Date(1446697142040-0500)\/",
"LastEditDate": "\/Date(1469351669667-0400)\/",
"Score": 6,
"ContributorCount": 2,
"BodyHtml": "
using StackExchange.Redis;\r\n\r\n// ...\r\n\r\n// connect to the server\r\nConnectionMultiplexer connection = ConnectionMultiplexer.Connect("localhost");\r\n\r\n// select a database (by default, DB = 0)\r\nIDatabase db = connection.GetDatabase();\r\n\r\n// run a command, in this case a GET\r\nRedisValue myVal = db.StringGet("mykey");\r\n
\r\n\r\n","BodyMarkdown": " using StackExchange.Redis;\n\n // ...\n\n // connect to the server\n ConnectionMultiplexer connection = ConnectionMultiplexer.Connect(\"localhost\");\n \n // select a database (by default, DB = 0)\n IDatabase db = connection.GetDatabase();\n\n // run a command, in this case a GET\n RedisValue myVal = db.StringGet(\"mykey\");",
"IsPinned": false
},
{
"Id": 2,
"DocTopicId": 2,
Again an array of objects with “BodyHtml” perhaps the text I sought. I searched for one of the DocExampleIds against my UserId and found the html I sought.
I wrote code to extract each object containing a DocExampleId for which I was listed as a contributor which I saved as a row within another Excel worksheet. This worksheet contained all the text I wanted together with links to my images.
I had problems reading “examples.json” which is 92Mb and uses UTF-8 encoding. I suspect my problems are unique to VBA. If you are interested in these problems, see: https://stackoverflow.com/q/46838258/973283.
Conclusion
The summary section is my review of the SO documentation archive. The detail section is intended to justify my review and to help and encourage anyone who wishes to extract information from this archive but is intimidated by the format.
17,366 Views
22 Favorites
DOWNLOAD OPTIONS
IN COLLECTIONS
Community Texts Community CollectionsUploaded by Stack Exchange on